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Abstract. Structural nested models (SNMs) and the associated method 
of G-estimation were first proposed by James Robins over two decades 
ago as approaches to modeling and estimating the joint effects of a se¬ 
quence of treatments or exposures. The models and estimation methods 
have since been extended to dealing with a broader series of problems, 
and have considerable advantages over the other methods developed for 
estimating such joint effects. Despite these advantages, the application 
of these methods in applied research has been relatively infrequent; we 
view this as unfortunate. To remedy this, we provide an overview of 
the models and estimation methods as developed, primarily by Robins, 
over the years. We provide insight into their advantages over other 
methods, and consider some possible reasons for failure of the meth¬ 
ods to be more broadly adopted, as well as possible remedies. Finally, 
we consider several extensions of the standard models and estimation 
methods. 

Key words and phrases: Causal effect, confounding, direct effect, in¬ 
strumental variable, mediation, time-varying confounding. 


1. INTRODUCTION 

Structural nested models (SNMs) were designed 
in part to deal with confounding by variables af¬ 
fected by treatment (Robins (1986)). The problem 
arises when one is interested in estimating the joint 
effect of a sequence of treatments in the presence of 
a variable L with three characteristics, depicted in 
Figure 1: 
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1. It is independently associated with the outcome 
Y of interest. This can happen because (a) it is 
a direct cause of the outcome, or because (b) it 
shares unmeasured common causes with the out¬ 
come of interest. 

2. It predicts subsequent levels (Hi) of the treat¬ 
ment; 

3. It is affected by earlier treatment (Ao). 

As a motivating example, consider an observa¬ 
tional study of the effect of erythropoietin alpha 
(EPO) on mortality in a population with end-stage 
renal disease (ESRD) receiving hemodialysis. Pa¬ 
tients on dialysis tend to be anemic, as commonly 
measured via hematocrit (Hct) or hemoglobin levels. 
EPO is used to treat the anemia and stimulate the 
body’s production of red blood cells; Hct (L) thus 
satisfies covariate characteristic 3. Furthermore, pa¬ 
tients with more severe anemia (lower Hct) typically 
receive higher doses of EPO (characteristic 2), and 
sicker patients tend to be more anemic [character¬ 
istic 1(b)]. Both these characteristics 1 and 2 make 


1 




2 


S. VANSTEELANDT AND M. JOFFE 



Fig. 1. Causal diagram for time-varying treatment. 

Hct a confounder of the effect of later treatment, 
requiring adjustment to estimate the effect of EPO 
A\. Observational studies of the effect of extended 
EPO dosing on mortality will thus be characterized 
by confounding by a variable (Hct) affected by treat¬ 
ment. 

In settings like the above, where the interest lies 
in estimating the joint effect of a sequence of treat¬ 
ments, standard methods which attempt to estimate 
these effects simultaneously (e.g., regression of Y on 
Aq and A\ or some function of both) will be inap¬ 
propriate, whether or not one adjusts for or con¬ 
ditions on the confounder L. Characteristics 1(a) 
and 3 make Hct (L) an intermediate variable on the 
pathway from early EPO treatment Aq to outcome 
Y ; adjustment for it blocks the path Ho —>• L —>Y, 
making it impossible to find the part of the effect of 
early EPO treatment (Ho) mediated by Hct. Char¬ 
acteristics 1(b) and 3 make Hct (L) a so-called col¬ 
lider (Pearl (1995)) on the path Ho —> L •$— U —> Y; 
conditioning on or adjusting for it induces associa¬ 
tions between Ho and Y even if no effect of Ho on 
Y exists. 

Over an extended period of time, James Robins 
(with some help from collaborators) introduced 
three basic approaches for dealing with such con¬ 
founding: the parametric G-formula (Robins (1986)), 
structural nested models (Robins (1989); Robins 
et al. (1992)) with the associated method of G- 
estimation and marginal structural models (Robins, 
Hernan and Brumback, 2000) with the associated 
method of inverse probability of treatment weight¬ 
ing. As we will argue throughout this paper, SNMs 
and G-estimation are, in principle, better tailored 
for dealing with failure of the usual assumptions of 
no unmeasured confounders or sequential ignorabil- 
ity often used to justify the application of all of these 
methods, as well as with (near) positivity violations 
whereby certain strata contain (nearly) no treated 


or untreated subjects (Robins (2000)). Despite these 
advantages, the application of these methods in ap¬ 
plied research has been relatively infrequent. 

Broadly speaking, there are two types of SNMs: 
models for the effect of a treatment or sequence of 
treatments on the mean of an outcome, and models 
for the effect of a treatment on the entire distribu¬ 
tion of the outcome(s). The former include struc¬ 
tural nested mean models (SNMMs), which have 
close links to structural nested cumulative failure 
time models (SNCFTMs) for survival outcomes; the 
latter include structural nested distribution mod¬ 
els (SNDMs), which have close links to structural 
nested failure time models (SNFTMs) for survival 
outcomes. For pedagogic purposes, we will introduce 
these models first for point treatments (i.e., treat¬ 
ments which are administered at one specific time 
point) in Section 2. We then discuss identifying as¬ 
sumptions and the associated G-estimation method 
in Section 3, and contrast it with alternative esti¬ 
mation methods for the effect of a point treatment 
in Section 4. These results are extended to time- 
varying treatments in Sections 5 and 6. We show 
how to predict the effects of interventions in Sec¬ 
tion 7, examine extensions to mediation analysis in 
Section 8 and conclude with a discussion. 

2. STRUCTURAL MODELS FOR POINT 
TREATMENTS 

2.1 Structural Mean Models 

Let Y a denote the outcome in a given subject that 
would be seen were the subject to receive treatment 
a. This variable is a potential outcome, which we 
connect to the observed outcome through the consis¬ 
tency assumption that Y = Y a if the observed treat¬ 
ment H = a; otherwise, Y a is counterfactual. Causal 
effects can now be defined as comparisons of poten¬ 
tial outcomes Y a and T at for the same individual 
subject or group of subjects for different treatments 
a and a) (Rubin (1978); Robins (1986)). In partic¬ 
ular, letting = 0 for notational convenience, av¬ 
erage causal effects can be defined in terms of com¬ 
parisons of average potential outcomes, for exam¬ 
ple, E{Y a \ L = l,A = a) - E(Y° \L = l,A = a) or 
E(Y a | L = l,A = a)/E(Y° \ L = l,A = a). 

Structural Mean Models (SMMs) (Robins, 1994, 
2000) parameterize average causal effects in subjects 
receiving level a of treatment as 

g{E(Y a \L = l,A = a)} 

-g{E(Y°\L = l,A = a)} = 'y*(l,a;P ), 
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for all l and a. Here, g(-) is a known link func¬ 
tion (e.g., the identity, log or logit link), 7 *{l,a\ip) 
is a known function, smooth in ip and satisfying 
7 *(Z, 0; ip) = 0 for all l and ip. Here and throughout, 
ip* is the true unknown finite-dimensional parame¬ 
ter. With a = 0 encoding absence of treatment—as 
we will assume throughout—SMMs thus express the 
effect of removal of treatment on the outcome mean. 

Typically, the parameterization is chosen to be 
such that 7 *(/, a; 0 ) = 0 for all a and l, so that ip* = 0 
encodes the null hypothesis of no treatment effect. 
For instance, for scalar covariate L one may consider 
the additive or linear SMM [which uses the identity 
link g(x) = x]\ 

E(Y a \ L = l,A = a) — E(Y° \L = l,A = a) 

( 2 ) 

= Oo + iMK 


for unknown ipQ,ip*. With A a binary exposure 
coded as 1 for treatment and 0 for no treatment, 
ipQ thus encodes the average treatment effect in the 
treated with covariate value L = 0, and ip* mea¬ 
sures how much the average treatment effect in the 
treated differs between subgroups with a unit differ¬ 
ence in L. Likewise, the multiplicative or loglinear 
SMM uses the log link g(x) = log(x), for example, 


E(Y a \L = l,A = a) 
E(Y° | L = l,A = a) 


e^p{{ipo + iptl)a}, 


and the logistic SMM uses the logit link g(x ) = 
logit( 2 ), for example, 


odds(y a = l \L = l,A = a) 
odds(T° = 1 | L = l,A = a) 


ex p{(V’o + V’lO 0 }) 


where odds(H = 1 | W) = P(V = 1 | W)/P(V = 0 | 
W) for random variables V and W. If treatment A 
can take on more than two values, then—without 
additional assumptions—the function 7*(Z,a;V’*) 
cannot be interpreted simply as a dose response 
function. This is because a dose response would 
contrast outcomes in the same subset at different 
levels of a [i.e., contrast E(Y a \ L = l,A = a) with 
E(Y a | L = l, A = a) for a / a 7 ], whereas the func¬ 
tions 7 *(l,a\ip*) and 7 *{l,a'-,ip*) for a^a! contrast 
causal effects for two different groups (namely, those 
with A = a versus A = a ', but the same L = 1). We 
will revisit this subtlety in Section 7. 

One can use a SMM to construct a variable U*(ip) 
whose mean value (in a subset of individuals with 
given covariates and treatment) equals the mean 


outcome that would have been seen had treatment 
been removed from that subset. Let 

U*(iP) = Y-'y*(L,A;iP), 
if g(-) is the identity link, 

U*{ip) = Y exp{— 7 *(L, A; ip)}, 
if g(-) is the log link and 

u*M 

( 3 ) 

= expit[logit{£(y | L,A)}-y*(L,A;ip)\, 
if g(-) is the logit link. Then 

(4) E{U*(ip*)\L,A} = E(Y°\L,A). 

This identity will be central to the estimation meth¬ 
ods for ip* that we will describe in Section 3. We 
could have defined U*(ip) in general—and in par¬ 
ticular for the identity and log link—as U*(ip) = 
g~ 1 [g{E(Y \ L,A)} — 7 * (L,A; ip)]. We have avoided 
doing this for the identity and log links as it makes 
the definition of U*{ip) dependent on the expecta¬ 
tion E(Y | L,A ), which can be undesirable when 
this demands additional modeling. However, this 
(or some alternative) is unavoidable for the logit 
link. Special estimation methods will therefore be 
required for logistic SMMs. 

SMMs can also be used to describe the effect of 
a multivariate point treatment. For instance, for 
a bivariate treatment A = (A^ l \ A^)', one may 
use a SMM with 7 *(L,A]ip) = ip\A^ + ip 2 A^ + 
ipzA^ A^ to describe the effect of setting both 
treatments to zero. When primary interest lies in the 
interaction (^ 3 ) between A^ and A ^ in their effect 
on the outcome, then one may instead consider the 
class of less restrictive Structural Mean Interaction 
Models (Vansteelandt et al. (2008a); Tchetgen Tch- 
etgen (2012)). To guard against misspecification of 
the main treatment effects, these further relax the 
SMM restrictions by merely parameterising the con¬ 
trast between the effects of A^ when A^ is set to 
some value a^ versus zero (or of the effects of A ^ 
when 7b 1 ) is set to some value a ^ versus 0 ): 

g{E(Y aW ’ a(2) | A = a,L = l)} 

-g{E(Y°> a(2) \A = a, L = l)} 

(5) -g{E(Y a(1) ’°\A = a,L = l)} 

+ g{E(Y 0,0 \ A = a, L = l)} 

= 7*(Z,a (1) ,a (2) ;V’*), 
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for a = (aW,^ 2 ))'; here, 7 *(7, a^ 1 )is a known 
function which encodes the interaction between both 
treatments, and which must be smooth in ip and 
satisfy 7*(Z,0,a^ 2 ^'i/’) = V0; ip) = 0 for all 
l,a^\a^ and ip. For instance, the natural choice 
7*(/,a( 1 ),a7 2 ); ip) = ipa^a^ imposes that the inter¬ 
action between both exposures is the same at all 
levels of l. 

2.2 Structural Distribution Models 

When the outcome mean does not adequately 
summarize the data or the interest lies more broadly 
in evaluating treatment effects on the outcome 
distribution, then Structural Distribution Models 
(SDMs) can be used instead. These are closely re¬ 
lated to SMMs, but instead map percentiles y of 
the conditional distribution of Y a , given L = l and 
A = a, into percentiles 7 (y,l,a;ip*) of the condi¬ 
tional distribution of Y°, given L = l and A = a. In 
particular, they postulate that 

( 6 ) F Y o lL=lA=a {'y{y, l , a; ip*)} = F Y a\ L=lA=a {y ), 

for all l and a. As with SMMs, 7 (y, l, a; ip) is a known 
function, smooth in ip and satisfying 7 ( 2 /, 1,0] ip) = 
y for all y,l. With a = 0 encoding the absence of 
treatment, SDMs thus express the effect of removing 
treatment on the outcome distribution rather than 
the outcome mean. 

Typically the parameterisation of a SDM is chosen 
to be such that 7 ( 7 /, Z, a; 0 ) = y, so that ip* = 0 en¬ 
codes the null hypothesis of no treatment effect. For 
instance, for scalar covariate L , one could assume 
that 

^Y°\L=i,A=a(y - V a - 1p*ai) 

( 7 ) 

— F Y o.\L=l,A=a{y), 

for all l and a. This characterizes a location shift 
model following which the conditional distribution 
of y°, given L and A, can be obtained by shifting 
the conditional distribution of Y, given L and A, 
by —ipQ^ ~ ip}A-L. One can use this to construct a 
variable 

u(V) = Vy,l,A;V) 

whose distribution (in a subset of individuals with 
given covariates and treatment) is the same as that 
of the outcome that would have been seen had treat¬ 
ment been removed from that subset, in the sense 
that 

Fy°\l,a{v) = Fu(ii)*) |l,a(2/); 


for example, U(ip) = Y — ipoA — ipiAL in the location 
shift example. This will be useful for the estimation 
of ip*. 

SDMs have a stronger variant called rank preserv¬ 
ing SDMs (Robins and Tsiatis (1991)), which pos¬ 
tulate that 

y° = 7 (Y,L,A;r). 

For instance, a stronger variant of the location shift 
model of the previous paragraph assumes that Y° = 
Y—iPqA — ip}AL. By making a mapping between the 
potential outcomes themselves (rather than between 
distributions), such rank preserving SDMs are eas¬ 
ier to understand and communicate. However, they 
are seldom plausible because they impose that the 
rankings of two subjects with different outcome val¬ 
ues but identical treatment and covariates are pre¬ 
served after mapping into Y° (hence the term “rank 
preserving”). In particular, they assume that sub¬ 
jects with identical outcome, treatment and covari¬ 
ate values experience identical treatment effects. 

Location shift SDMs like (7) make substantially 
stronger assumptions than correspondingly parame¬ 
terized SMMs. The distribution models assume that 
treatment level a shifts each percentile of the con¬ 
ditional distribution of Y, given L = l, A = a by a 
value 7 *(l, a; ip*) constant for all y [i.e., 7 (y, l, o; ip) = 
V — 7*(Z, cl', ip)]i whereas the mean model assumes 
only a mean shift of 7 *(l,a\ip)- When location shift 
is implausible, it can sometimes be made more plau¬ 
sible by transforming y. For instance, for strictly 
positive y, one might obtain a location shift SDM 
by defining 7 (y,l,a;ip) = exp{log(y) - 7 (l,a;ip)}- 
There will then be a correspondence between the 
parameters of the SDM and those of a SMM for 

log(y) - log{ 7 {y,l,a\ip)}- 

The parameterization and interpretation of SDMs 
that are not simply shift models can be tricky. This 
is because, by the nature of the cumulative distri¬ 
bution function, the function y(y, l, a; ip) must be 
increasing in y for each l, a and ip , and it may be 
difficult to impose that. For instance, the function 
7 (y,a,l',ip) = y — aipi — yaip 2 may appear natural, 
but is not guaranteed increasing in y. An alterna¬ 
tive function which is naturally increasing in y is 
7 (y,a,l;ip) = yexp(— 0 ^ 2 ) ~ aipi- Here, interpreta¬ 
tion is somewhat subtle; while ip 2 expresses the ef¬ 
fect of treatment A on the residual variability of Y, 
it also has implications for the effect of treatment on 
the mean of Y. and so ip\ cannot be interpreted sim¬ 
ply as the effect of treatment on the mean outcome, 
unless ip 2 = 0 . 


( 8 ) 
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SDMs lend themselves naturally to the analysis of 
failure times. For instance, consider model ( 6 ) with 
T a ,T° and t substituting for Y a ,Y° and y. Then 
the choice y(t, a, Z; if) = texp(aifo + ahfi) implies the 
failure time model defined by 

ST°\L=i,A=a{t exp(—a'i/’o - alifl)} = S T \ L=UA=a {t), 

for all l and a, where S(-) denotes the survival func¬ 
tion. This model, which is an example of a Struc¬ 
tural Accelerated Failure Time Model (SAFTM) 
(Robins (1989); Robins and Tsiatis (1991); Robins 
(1992); Robins et al. (1992)), expresses that treat¬ 
ment lengthens lifetime by a factor exp(a^Q + aktf *) 
(in distribution) among subjects with treatment a 
and covariate l. 

2.3 Structural Mean and Distribution Models for 
Repeated Measures Outcomes 

Structural mean and distribution models require 
some modification for repeated measures outcomes. 
The modifications for SMMs are simpler, but also 
allow a new class of models for discrete-time fail¬ 
ures. Extension of SDMs is more complicated. We 
consider these in order. 

We begin with some notation common to both 
types of models. Suppose that measurements on ex¬ 
posure and confounders are collected at time point 
to and that outcome measurements are recorded at 
fixed later time points t±,... ,tx+i- Let for a vari¬ 
able X, X k denote the level of the variable that one 
obtains at time t k . We use overbars to denote the 
history of a variable; thus, X k = {Xq,X\, ... ,X k } 
denotes the history of X through t k - We use un¬ 
derbars to denote the future of a variable; thus, 
X_ k = {X k ,... ,Xk~ pi}- Finally, we use X_ as short¬ 
hand notation for X_] and X k:m for m > k to denote 
(X k j • • • j X m ). 

2.3.1 Structural mean models and structural cu¬ 
mulative failure-time models Extension of SMMs to 
repeated measures is relatively straightforward, be¬ 
cause they model separately the effect of a treatment 
on each component outcome. SMMs parameterize 
contrasts of Y_ a and T° as 

g{E(Y a \L = l,A = a)}-g{E(Y° \L = l,A = a)} 
= 7*(Z,a;V’*), 

for all l and a. Here, g(-) is a known (K + 1)- 
dimensional link function, 7 *(l,a]if) is a known 
(K + l)-dimensional function with components 
7 1(1, a; il>) 1 k = 1,..., K + 1, that parameterize the 


effect of treatment on Y k . These components are 
smooth in if and satisfy ^(1,0] if) = 0 for all l and 
if. For instance, the SMM defined by 

E(Y£ | L = l,A = a)~ E{Y% \ L = l,A = a) 

= (V’o +^iO a (4 -*o), 

for k = 1,..., K + 1, expresses that the effect of 
treatment a may depend on covariates l and changes 
linearly over time, being zero at the baseline time fo¬ 
under this repeated measures SMM, as in Sec¬ 
tion 2.1, it is possible to define a transformation 
U*(if ) of the observed outcome vector Y_ so that 

E{U*(if*)\L,A} = E(Y°\L,A). 


Here, U*(if ) is a vector with components Y k — 7 £(L, 
A;if) for k = + 1 if g(-) is the identity 

link, Yfc exp{—^ 7 fc(L, A; if)} if g(-) is the log link, and 
expit [logit {E (Yfc | L,A)} - 7 l(L,A]if)] if g(-) is the 
logit link. 

Structural Cumulative Failure Time Models 
(SCFTMs; Picciotto et al. (2012)) are a variant of 
repeated measures loglinear SMMs for the modeling 
of cumulative failure time probabilities: 


P(T a <tk\L = l,A = a) 
P(T° < t k | L = l,A = a) 


ex P(7 


for all l, a and k = 1,..., K + 1. A limitation of this 
class of models is that their parameterization can be 
tricky when the cumulative probability of failure be¬ 
comes large, because the model does not restrict the 
outcome probabilities to stay below 1. Martinussen 
et al. ( 2011 ) independently proposed a continuous¬ 
time version of this model and lay out connections 
with additive hazard models. 


2.3.2 Structural distribution models For multi¬ 
variate outcomes, SDMs parameterize the effect of a 
treatment A on the marginal distribution of the vec¬ 
tor of future potential outcomes Y_ a . This mapping 
is typically done recursively, taking the components 
Yu and Y^ in forward sequence. These models are 
therefore most easily understood by first considering 
the class of more restrictive rank-preserving SDMs, 
which postulate that, for subjects with A = a and 
L = l : 

(9) 

for k = 1,...,K + 1 . Here, 7fc(?/fc, Vk-i^h a; VO is a 
known function, smooth in if and monotonic in y^, 
and 7fc(2/fc:27fc-i^,0;?/>) = y k for all y k ,l, and 4k For 
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instance, with two time points (K = 1), a rank pre¬ 
serving SDM may be given by the following set of 
restrictions: 

Y 2 o =Y 2 -(ri+r 2 Yi)A, 

( 10 ) 

Yi =Yi- 

If the effect of A on Y 2 varies with Y± , as in this ex¬ 
ample, then one must model this explicitly since the 
model would otherwise—perhaps unrealistically— 
assume that treatment does not affect the corre¬ 
lation between repeated outcomes (conditional on 
A, L ). This is unlike in SMMs where one can average 
the effect of A on Y 2 over all Yi-values. This makes 
it substantially more difficult to parameterize SDMs 
than SMMs. It moreover complicates the interpre¬ 
tation of effects; for example, ipl in ( 10 ) is difficult 
to interpret when 7 ^ 0 since it expresses the ef¬ 
fect of treatment on Y 2 in subjects with A = 1 and 
Y\ = 0, where Y\ may itself be affected by treatment. 
Equation (10) may hence by easier to interpret upon 
re-expressing it as 

y 2 ° = y 2 - {Vb* + V >2 (Y° + r 3 A)}A 

= Y 2 - (V’l* + V’2*n° + ^A)A. 

A SDM relaxes the restrictions of the rank¬ 
preserving SDM by demanding that the equality (9) 
merely holds in distribution, conditional on L = / 
and A = a. Assuming that given L and A, Y has 
a continuous multivariate distribution with proba¬ 
bility 1, a SDM can thus be defined by the set of 
restrictions 


F Y°\L=l,A=a{7(y> h «; V’ *)} = F Y“\L=l,A=a(y) 

= FY\L=l,A=a(y ), 

for all l, a, where 

7 (y,l,a;if>*) = {71(2/1,/, a; V'*), 

72(1/2,/, a; 2A*),..., 
7K+i(y K +ii l i a ^*)} 

is given by (Robins, Rotnitzky and Scharfstein 

( 2000 )): 


71(2/1,/, a; V 7 *) = F Y^\L—l,A=a ° F Yi\L=l,A=a(.yi) 1 

7 a, (y k ,l,a;ip*) = F~ 1 _ 0 

K k ’ Y°\L=lA=aYt- 1 =li:h-i(y k - 1 ,l,a;r) 

° F Y k \L=l,A=a,Yk-l=y k -S yk ) J 


for k = 2,...,K + l. For instance, the SDM corre¬ 
sponding to ( 10 ) may be written: 

F Y°\L=l,A=a(yi - V>3 a) 

= F Y 1 \L=l,A= a (y 1 )) 

( 11 ) 

F Y 2 °\L=l,A=a,Y 1 °=y 1 -i/>*a{y2 ~ (V’l + V^Z/lW 

= F Y 2 \L=l,A=a,Yl=Vi 

The decomposition of the causal effects in the blip 
functions 7 fc(t/ fc , Z, a;'*/’*) is recursive because one 
must model not merely average effects but instead 
the full mapping between distributions. In partic¬ 
ular, the effect of treatment on the first potential 
outcome is modeled first; then, mapping between 
distributions is done successively for the outcome at 
successive times. The overall blip function encoded 
by y(-) and the first element of this function has 
the usual structure and interpretation of causal es- 
timands; that is, as a comparison of distributions 
of potential outcomes under different interventions 
for the same group of subjects. However, the com¬ 
ponent functions 7 k('),k > 1 do not in general have 
this interpretation, since the conditioning in these 
mapping functions is not common between Yj? and 
Y k ; for instance, the left-hand side of (11) conditions 
on Yj°, whereas the right-hand side conditions on Y\. 
Nonetheless, these component functions are causal 
in the sense that they represent the impact of treat¬ 
ment on the conditional distribution of a variable. 
This feature is shared with the causal rate or hazard 
ratio (Hernan, 2010). Under the strong assumption 
of rank preservation, the conditioning is on a com¬ 
mon variable, and so then the components of the 
blip function do have a standard causal interpreta¬ 
tion. 

For repeated measures outcomes, SDMs corre¬ 
spond with similarly parameterized SMMs if the 
SDMs are shift models. In a shift SDM, the com¬ 
ponent functions 7 fc( 2 /fc,I/fc-i> Z, a ) may be written as 
lk{yk,yk-Y l i a ) = yfc-7fc(^a;V0- These require that 
the shift in percentiles of the distribution of y k not 
only be independent of y k but also of Tj k _\- Thus, 
shift SDMs make substantially stronger assumptions 
than similarly parameterized SMMs. 

Under the SDM, a (K + l)-dimensional vari¬ 
able U(ip*) = {U\ (?/’*),..., Ux+ii'ip*)} can be con¬ 
structed with components U k ('p) = 7 k(Y k ,L,A;i/>). 
This vector mimics the counterfactual outcome vec¬ 
tor Y° in the sense that 

P{U(i/>*) >y\L,A} = P(Y° >y\L , A}. 

This result will be useful for estimation. 
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2.4 Retrospective Blip Models 

The blip functions and causal models discussed 
above largely consider the effect of a blip of treat¬ 
ment conditional only on treatment and baseline 
covariates; the sole exception has been SDMs for 
repeated measures outcomes, where the effect of 
treatment on later outcomes is modeled addition¬ 
ally conditional on earlier outcomes, and where the 
interpretation of the model parameters is not clear 
as a usual causal contrast. This focus is consistent 
with an orientation of the models to be more di¬ 
rectly useful for making decisions, where the effect 
of treatment is modeled conditional only on infor¬ 
mation available at the time of the decision. 

For explanatory purposes, one can construct 
structural models for the effect of a treatment con¬ 
ditional on information not available at the time of 
treatment. Such models may have explanatory uses 
even though the quantities they model are less di¬ 
rectly relevant for making decisions. Consider mod¬ 
eling the effect of screening mammography on breast 
cancer mortality (Joffe, Small and Hsu (2007)). To 
a first approximation, one might assume that the 
mammogram has an effect on death only among 
subjects for whom it detects a tumor. Suppose that 
some subjects undergo screening at the start of the 
study (A = 1; A = 0 otherwise). Let L\ indicate 1 if 
cancer is detected at time t± after the start of the 
study and 0 otherwise. It is of interest to know how 
much the screening mammogram affects mortality 
for subjects for whom it is effective in detecting can¬ 
cer. We can then model the effect of the treatment 
on the outcome using a retrospective SDM (RSDM) 
or SFTM, which conditions on L\ in addition to 
treatment and baseline covariates: 

FY°\L 0 =lo, L i=h,A=a{l(y, lo, h,a-, V’*)} 

( 12 ) 

— FY\L 0 =l 0 ,L 1 =hA=a{y) ■ 

In this example, we might assume that 7(y,/o 5 0,a; 
ip*) = y to reflect that screening has no effect in 
subjects for whom no tumor is detected. Note that 
though L\ may be affected by A, conditioning on 
it does not distort the interpretation of the param¬ 
eters as encoding a causal effect because identity 
(12) still involves a comparison of the same subjects 
(those with Lq = Iq, Li = l\, A = a) under different 
interventions. 

Models of this sort might also be useful in de¬ 
termining whether the effect of a treatment given 


at baseline is modified by post-treatment covari¬ 
ates and so whether there are identifiable sub¬ 
groups of subjects for whom treatment appears not 
to be working (Stephens, Keele and Joffe (2013)). 
Changes or additions to treatment might then be 
proposed in such subgroups after baseline. Joffe, 
Small and Hsu (2007) consider the relation be¬ 
tween these retrospective models and the popular 
approach of principal stratification (Frangakis and 
Rubin (2002)). These models can generalize to a 
sequence of time-varying treatments, where there 
are additional justifications for their use (see Sec¬ 
tion 5.3). 

3. IDENTIFICATION AND ESTIMATION IN 
STRUCTURAL MODELS FOR POINT 
TREATMENTS 

Two kinds of assumptions have been proposed for 
use in most of the literature on estimation in SMMs 
and SDMs: no unmeasured confounders and instru¬ 
mental variables type assumptions. In this section, 
we will focus on the former, and defer discussion of 
the latter to Section 6.3. 

3.1 Ignorability 

The required no unmeasured confounders assump¬ 
tion for the identification of the parameter ip* index¬ 
ing SMMs and SDMs can be formulated as 

(13) A1YY°\L, 

where U J_L V | W for random variables U, V, W 
denotes that U is conditionally independent of V, 
given W. This assumption, which is empirically un- 
verifiable, expresses that L is sufficient to adjust for 
confounding of the association between A and Y. 
Assumption (13), which is also referred to as the 
weak ignorability or exchangeability assumption, is 
weaker than the strong ignorability assumption of 
Rosenbaum and Rubin (1984) which, for binary 
treatments, states that All. (y 0 ,!^ 1 ) | L. However, 
it is generally difficult to imagine settings where as¬ 
sumption (13) holds, but strong ignorability fails 
(one exception might be settings where individuals 
choose treatment on the basis of their perceived be¬ 
lief of benefit, which may be correlated with actual 
benefit Y 1 — y°). That (13) is a weaker assump¬ 
tion is exhibited in the fact that, for binary treat¬ 
ments, it only identifies the effect of treatment on 
the treated—a contrast that has been of interest 


8 


S. VANSTEELANDT AND M. JOFFE 


in econometrics and epidemiology (Greenland and 
Robins (1986)): 

E(Y 1 — Y° \A = 1,L) 

= E(Y 1 \A = 1,L)- E(Y° | A = 1,L) 

= EpY 1 \A = 1,L)- E(Y° \A = 0,L) 

= E(Y \A = l,L)~ E(Y | A = 0,L); 

the second equality follows due to ignorability (13) 
and the third due to the consistency assumption. 
The parameters of SMMs, SDMs and SCFTMs rep¬ 
resent the effect of treatment in the treated (or, more 
generally, the effect of receiving treatment level a for 
subjects who received level a of treatment), and so 
this weaker assumption is sufficient for identifica¬ 
tion. 

It follows by a similar reasoning that the blip 
functions in the SMMs and SDMs discussed in Sec¬ 
tions 2.1-2.3 are nonparametrically just identified 
under ignorability (Robins, Rotnitzky and Scharf- 
stein (2000)). That is, the contrast of the outcomes 
under the observed treatment and the outcomes that 
would have been seen in the absence of treatment is 
computable for each level of a and l (and, for SDMs, 
of y) from the law of the observables without assum¬ 
ing any restrictions or parameterization on these 
functions. While such nonparametric identification 
is of limited use in complex settings (especially with 
time-varying treatments considered subsequently), 
due to the curse of dimensionality (Robins and Ri- 
tov (1997)), it does ensure the ability to check the 
assumptions in any assumed causal model (provided 
a sufficient sample size). In contrast, the retrospec¬ 
tive blip functions considered in Section 2.4 are not 
identified nonparametrically (Vansteelandt (2010); 
Stephens, Keele and Joffe (2013)). Multiple retro¬ 
spective blip models may thus explain the same law 
of the observables equally well even under ignorabil¬ 
ity. 

3.2 Estimation Under Ignorability 

The SMM together with the ignorability assump¬ 
tion (13) implies that 

E{U*(ip*) \L,A} = E{Y°\A,L) = E(Y°\L) 

= E{U*(ip*) | L}. 

Estimation of ip* in a SMM can thus be based on 
solving estimating equations: 

n 

0 = ^ [d*(Ai, - E{d* (Ai , Li) | Li}\ 

1=1 

■[U*(iP)-E{U*(iP) | Li}], 


which essentially set the empirical conditional co- 
variance between U*(ip) and arbitrary functions 
d*(A,L) of the dimension of ip, given L, to zero. 
For instance, for model (2), the choice d*(Ai,Li ) = 
(1 ,Li)'Ai results in estimating equations 


1=1 ' ' 

(15) • [Yi - E{Yi | U) 

- (ip 0 + ipiLi){Ai - E(Ai | Li)}}, 

from which estimates for (ipo,ipi) can be solved. 
A locally efficient estimator of ip* [under the SMM 
together with the ignorability assumption (13)] can 
be attained by setting 


d*{A, L) = E 


dU*(ip* 

dip 


A,L 


when the variance of U*(ip*) given A, L is constant; 
local here means that the efficiency is only attained 
when this constant variance assumption is met and 
models for all conditional expectations involved in 
(14) are correctly specified. 

The SDM together with the ignorability assump¬ 
tion (13) implies the more restrictive constraint that 


(16) U(ip*)±LA\L. 


This motivates estimating ip* by picking the value 
ip that makes this conditional independence hold. 
This forms the default approach in SAFTMs, where 
estimation is based on a grid search whereby the 
independence (16) is tested for different values of 
ip* using a (standard) statistical test until it is found 
to be satisfied (Robins et al. (1992)). Equivalently, 
estimation can be based on solving an estimating 
equation of the form 

n 

o =j2 d i u iW, A i, L i} 

i=1 

-EldiUiW^^L^lLMiP)} 

(17) 

-EidiUitylAitLi} 

- E[d{Ui(iP),Ai,Li} | LMiP)} | A h Li), 

for ip, where d{Ui(ip), Ai, Lj} is an arbitrary in¬ 
dex function of the dimension of ip', for example, 
d{Ui(ip), Ai, Li} = (1 ,Li)'AiUi(ip). A locally efficient 
estimator of ip* [under the SDM together with 
the ignorability assumption (13)] can be obtained 
by solving (17) with d{U{ip), A, L} = E{S^(ip) \ 
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U(ip),A,L}, where S^(ip) is the score for ip under 
the observed data likelihood 

( 18 ) *^jp-f(L)f{U(f)\L}f(A\L) 

with all components substituted by suitable para¬ 
metric models (Robins (1997)). For instance, under 
model (7) with U(ip*) given L following a normal 
distribution with mean linear in L and constant vari¬ 
ance, S^(ip) = (1, L)'A{aU(ip) + bL + c} for certain 
constants a, b, c, so that a locally efficient estimator 
is obtained by solving (15). 

Estimating equations of form (14) and (17) may 
also be used for repeated measures outcomes. In 
(14), d*(Ai,Li) now becomes a px (K + ^-dimensio¬ 
nal matrix, with p the dimension of ip. In (17), 
d{Ui(ip), Ai, Li} remains an arbitrary index func¬ 
tion of the dimension of ip] for example, d{Ui{ip), 

A i ,L i } = (l,L i )'A l J2*t 1 1 U l mW. 

Remark. Note that the SMM together with as¬ 
sumption (13) is the same model for the observables 
as the semiparametric regression model (Chamber- 
lain (1987)): 

(19) g{E(Y | L,A)} =uj(L) + 'y*(L,A]ip*), 

with lo(L) unspecified. Likewise, the SAFTM [with, 
e.g., 7 (t,a,l;ip) = texp(— aip)\ together with as¬ 
sumption (13) can be viewed as a semiparamet¬ 
ric generalization of the accelerated failure time 
model (Wei (1992)), defined by logT = ipA + e with 
e_LL A | L. 

Because of the curse of dimensionality, evaluating 
the conditional expectations appearing in equations 
(14) and (17) requires a parametric working model 
A for the conditional distribution of the exposure A: 

f(A\L) = f(A\L-a*y, 

here f(A | L; a) is a known density function, smooth 
in a, and a* is an unknown finite-dimensional pa¬ 
rameter. For instance, for dichotomous exposure, 
one could assume that P(A = 1 | L) = expit(ceg + 
a\L ) with a* = (ao,oq)'. Here, a* can be estimated 
via standard (maximum likelihood) methods. 

Evaluating (14) and (17) moreover requires a 
parametric working model B for the conditional dis¬ 
tribution of U(ip*) or the conditional expectation of 
U*(ip*). For (17), we model: 

f{u(r)\L} = f{u(r)\L-,(3*}, 


where f{U(ip*) \ L](3} is a known density func¬ 
tion, smooth in (3, and /3* is an unknown finite¬ 
dimensional parameter; to evaluate equations (14), 
specification of the conditional mean of U*{ip*), 
given L, suffices. For instance, for a continuous out¬ 
come, one could assume that conditional on L and 
for given ip*, U(ip*) = Y — ipQA — ip*AL is normally 
distributed with mean /3q + j3\L and variance /3| 2 , 
with /3* = (/3q, /?*, /3|)'. For each fixed value of ip*, 
/3* can be estimated using standard regression meth¬ 
ods. 

A consistent estimator of ip* indexing the SMM or 
SDM can now be obtained by solving equations (14) 
or (17), respectively, with a* and /3* substituted by 
consistent estimators under models A and B, respec¬ 
tively. The resulting estimator of ip* is called a G- 
estimator. In SDMs and linear or loglinear SMMs, it 
has the attractive property of being doubly robust 
(Robins and Rotnitzky, 2001): consistent when ei¬ 
ther model A or model B is correctly specified (in 
addition to a correctly specified structural model 
and ignorability); it does not require both to be cor¬ 
rectly specified, nor does it require specifying which 
of both is correctly specified. That the solution to 
equation (14) is doubly robust can be seen because 
this equation has mean zero at ip = ip* when either 
model A or model B is correctly specified, even if 
one of them is misspecified. Equation (17) is like¬ 
wise seen to have mean zero at ip = ip* under model 
£>; that it also has mean zero under model A at 
ip = ip* is seen by rewriting the equation as 

n 

0 = Y J d{U i (iP),A i ,L l } 

2=1 

-E[d{U i (iP),A i ,L i }\L i ,A i ] 

-E(d{Ui(iP),Ai,Li} 

- E[d{Ui{iP),Ai, | U, A x \ | Ui(iP),Li). 

The result now follows, provided that the param¬ 
eters a and (3 are variation-independent (i.e., not 
functionally related), so that a consistent estima¬ 
tor of a* does not require consistent estimation of 
[i* and vice versa. Sandwich standard errors are 
obtained via the usual estimating equations the¬ 
ory. 

In logistic SMMs, to the best of our knowledge, 
no estimators of ip* have been found that are root-n 
consistent under model A and the ignorability as- 
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sumption. This is because the evaluation of U*{ip) 
is anyway dependent upon a model for the condi¬ 
tional mean E(Y \ A,L ) [see (3)]. Tchetgen Tchet- 
gen, Robins and Rotnitzky (2010) show that dou¬ 
ble robustness can instead be attained against mis- 
specihcation of either a model for the density f(Y \ 
A = 0,L) or a model for the density f(A \Y = 0, L). 
Their key to estimation of ip* is that the parameter¬ 
ized association 7 *(L, A;^) between A and Y, when 
evaluated at ip = ip*, can be used to render A and 
Y conditionally independent (given L ) via inverse 
probability weighting. Their results apply equally to 
case-control designs (Tchetgen Tchetgen and Rot¬ 
nitzky (2011)). 

For Structural Mean Interaction Models, inference 
is developed in Vansteelandt et al. (2008a) when g(-) 
is the identity or log link and in Tchetgen Tchetgen 
(2012) when g(-) is the logistic link. Tchetgen Tch¬ 
etgen and Robins (2010) focus on case-only designs 
and note that when g(-) is the log link, the mul¬ 
tiplicative interaction (5) is identical to the condi¬ 
tional odds ratio between A.B) and A given L 
within the subgroup of cases. This enables the use 
of results on logistic SMMs (Tchetgen Tchetgen, 
Robins and Rotnitzky (2010)) for robust estima¬ 
tion of multiplicative interactions under outcome- 
dependent sampling. 

3.3 Censoring 

Censoring presents additional challenges for the 
analysis of failure-time outcomes T. Random censor¬ 
ing or loss to follow-up can be dealt with through 
inverse probability of censoring weighting (Robins 
et al. (1992)). Type I censoring, also known as cen¬ 
soring by end of follow-up, can be ignored in the 
analysis of SCFTMs, but must be dealt with in a 
different fashion in the analysis of SAFTMs. This is 
because U(ip*) involves the failure-time itself, which 
is missing for all subjects who fail after planned end- 
of-follow-up; the coarsening process is informative 
here as it depends on the actual failure time. We 
will next describe how Type I censoring can be dealt 
with in the analysis of SAFTMs. 

Let C denote the planned end of follow-up time 
for given individual. C is known for all subjects, 
even those observed to fail. However, U{ip) cannot 
be evaluated for those who do not fail prior to time 
C. Knowing that U(ip*) _LL A \ L under ignorability, 
the aim is then to find a function q{U{ip),C} which 


is observable for all individuals and for which 

q{TJ{ip*),C} A±A\L. 

If such function is found, then ip* can be estimated 
by solving the original estimating equations for 
SDMs with q{U(ip),C} replacing U(ip). A natural 
choice would be q{U(ip),C} = min{[7( ip),U(C, A, L; 
ip)} with U(C, A, L-,ip) the blipped-down censor¬ 
ing time, which is defined like U{ip) but with T 
substituted by C. However, this choice would not 
satisfy the required conditional independence prop¬ 
erty. The reason is that since C is fixed by de¬ 
sign, U{C,A,L\ip) will in general be a function of 
A when ip ^ 0 and so will generally fail to be con¬ 
ditionally independent of A, given L. Robins and 
Tsiatis (1991) thus propose to eliminate the de¬ 
pendence of U(C, A, L ; ip) on A by redefining it to 
be C(ip) = in\ii a {U(C, a, L; ip)}. By thus minimizing 
over all feasible treatments a, any dependence on 
the observed treatment is broken so that X{ip) = 
min {U (ip), C (ip)} and A (ip) = I{U(ip) < C(ip)} be¬ 
come always observable quantities that are indepen¬ 
dent of A given L under ignorability, when evaluated 
at ip*. We may thus choose q{U(ip),C} to be an ar¬ 
bitrary function of X(ip) and A (ip). 

With each choice of q{U(ip),C}, some subjects 
who are observed to fail may be treated as cen¬ 
sored when ip ^ 0. This can happen because for 
some subjects, C(ip) may be smaller than U(ip) 
even though T < C. Such subjects are called artifi¬ 
cially censored. Artificial censoring has several con¬ 
sequences. Besides decreasing information about ip* 
as more subjects are artificially censored, the es¬ 
timating equations are not, in general, continuous 
in ip. This is because the functions q{U(ip),C} are 
not generally continuous in ip, which happens in 
part because A [ip) is not a smooth function of ip. 
This can present problems for optimization, espe¬ 
cially when ip is a vector, and may moreover imply 
that the estimating equations have no solution in 
finite samples. This problem may be mitigated by 
choosing q{U{ip), C} to be a smooth function of ip, 
for example, q{U(ip), C} = X{ip)w a {X{ip)/C{ip)}, 
where w Q (t) = I(t > 1 — a)(l — t)/a + I{t < 1 — a) 
(Joffe, Yang and Feldman, 2012). Vock et al. (2013) 
consider functions q(-',ip) whose first derivatives ex¬ 
ist for all ip-, they appear to have had better suc¬ 
cess in convergence for their optimization algo¬ 
rithm. 
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4. PROPERTIES OF G-ESTIMATION IN 
STRUCTURAL MODELS FOR POINT 
TREATMENTS UNDER IGNORABILITY 

4.1 Comparison with Ordinary Regression 
Estimators 

Insight into the behavior of G-estimators can be 
garnered by focusing on the simple model AIsmm 
defined by the ignorability assumption that Y a _L 
_L A | L for a = 0,1, known treatment mechanism 
f(A\L) and the SMM 

E(Y a -Y° \A = a,L) = ip* a. 

Under homoscedasticity (i.e., when the conditional 
variance of the outcome, given A and L, is a constant 
a 2 ), the locally efficient G-estimator of ip* under 
model .Msmm has influence function (Newey (1990)) 

E{V&r(A | L)}~ l {A - E(A \ L)} 

( 20 ) 

• {Y - ip* A - E(Y - ip*A | L)}; 

it can thus in particular be obtained by setting the 
sample average of these influence functions to zero 
and solving for ip*. For binary treatment A, lin¬ 
ear regression adjustment for the propensity score 
(Rosenbaum and Rubin (1984)) results in an esti¬ 
mator of ip* with influence function of the same 
form (20), but with E(Y — ip*A \ L) substituted 
by the population least squares fit from a regres¬ 
sion of Y — ip*A on the propensity score E(A \ 
L). Linear regression adjustment for the propensity 
score can therefore be viewed as an inefficient and 
nondoubly robust G-estimation approach (Robins, 
Mark and Newey (1992)). The close relation be¬ 
tween G-estimation and regression adjustment for 
the propensity score is not maintained in nonlin¬ 
ear models, where propensity score adjustment may 
not only demand correct models for the propen¬ 
sity score, but also for its association with outcome 
(Vansteelandt and Daniel, 2014). In nonlinear mod¬ 
els, due to non-collapsibility of the treatment effect 
parameter (Greenland, Robins and Pearl (1999)), its 
meaning may also change depending on whether co¬ 
variates are adjusted for in addition to the propen¬ 
sity score. 

Ordinary regression estimators [in particular, 
maximum likelihood estimators obtained by fitting 
model (19) under a finite-dimensional parameteri¬ 
zation of co(L)\ are at least as efficient as the pre¬ 
viously considered G-estimators, provided correct 


model specification. From the variance of the inffu- 
ence functions, we can deduce that the asymptotic 
variance of the locally efficient G-estimator is 


v 7 E{Yav{A\L)}' 

when there is homoscedasticity and the conditional 
mean E(Y — ip*A \ L ) = E(Y \ A = 0, L) is correctly 
specified. The ordinary least squares (OLS) estima¬ 
tor under the linear regression model E(Y \ A, L) = 
(3'L + ip A has an asymptotic variance which is 
smaller but, interestingly, usually not much smaller: 


U[Var(A | L) + {E(A \ L ) - E(A | L)} 2 ]' 

This follows from its influence function, which is of 
the same form (20), but with E(A j L ) substituted 
by E(A | L), the population least squares fit from a 
regression of A on L. 

Despite their greater efficiency, ordinary regres¬ 
sion estimators have a number of limitations not 
shared by G-estimators, an important one being 
their lack of extensibility to the analysis of sequen¬ 
tial treatments (see Section 5). Furthermore, their 
explicit reliance on a model for the association be¬ 
tween outcome and covariates can be disadvanta¬ 
geous when the treated and untreated subjects are 
very different in their covariate distributions, for 
then even well-fitting models for the outcome may 
be prone to extrapolation bias (Rosenbaum and Ru¬ 
bin (1984)). This is not the case for G-estimators 
when they are based on a correctly specified model 
(^4) for the treatment process. This is also seen from 
the form of the influence functions (20), following 
which individuals in regions of little or no overlap 
[i.e., at covariate values L where Var(A | L ) is small] 
will hardly contribute in the calculation of the G- 
estimator because A — E(A | L) « 0 for such indi¬ 
viduals. As with other estimation approaches based 
on propensity score adjustment (e.g., matching), the 
information about ip* will thus come primarily from 
regions with sufficient overlap, which we view as de¬ 
sirable. In contrast, OLS estimators are more sus¬ 
ceptible to extrapolation bias since the leading term 
A — E{A j L) in their influence functions may be 
far from zero for individuals in regions of little or 
no overlap. Finally, an advantage of G-estimation 
methods is that they can incorporate a priori knowl¬ 
edge on the exposure distribution. For instance, 
Vansteelandt et al. (2008b) exploit knowledge on the 
distribution of offspring genotypes given parental 
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genotypes (based on Mendel’s law of segregation), 
by using G-estimators to develop gene-environment 
interaction tests that are robust against misspecifi- 
cation of the effect of environmental exposures on 
the outcome. 

4.2 Comparison with Inverse Probability 
Weighted Estimators 

For the analysis of sequential treatments (see Sec¬ 
tion 5), marginal structural models (MSM) (Robins, 
Hernan and Brumback, 2000) and inverse proba¬ 
bility weighted (IPW) estimators are much more 
popular than SMMs and SDMs and G-estimators. 
This is related to G-estimation being computation¬ 
ally more demanding by the lack of off-the-shelf soft¬ 
ware. It is thus of interest to compare the behaviour 
of these estimators in a simple setting with dichoto¬ 
mous treatment. Consider therefore model .MmsMj 
which is defined by the ignorability assumption that 
Y a _LL A | L for a = 0,1, known propensity score 
E(A | L) and the nonparametric MSM 

E{Y a ) = a + ip*a. 

Note, since Y a _LL A | L for a = 0,1, that ip* = 
EpY 1 - Y°) in both models AIsmm and .MmsM) 
and thus defines the same parameter. Nonetheless, 
model .Mmsm is less restrictive than model A4smm 
in that it does not postulate that the treatment ef¬ 
fect is homogeneous (i.e., constant over levels of L ). 
This explains why the asymptotic variance of the lo¬ 
cally efficient IPW estimator under model AImsMj 
which has influence function (Robins, Rotnitzky and 
Zhao (1994)) 

A{Y - E{Y \A = l, L)} 

E{A | L) 

{1-A){Y-E(Y\A = 0,L)} 

1 -E(A | L) 

+ E(Y \A = l,L)~ E(Y \A = 0,L)- ip*, 

is strictly larger than the variance of the locally effi¬ 
cient G-estimator (unless A and L are independent, 
as may be the case when A refers to a randomized 
treatment, in which case they are equally efficient). 
In particular, the asymptotic variance of the locally 
efficient IPW estimator equals 

< 22 > 

when the treatment effect is homogeneous. The dif¬ 
ference between ( 21 ) and ( 22 ) can be sizeable when 


the propensity score is close to zero or 1 for some 
values of L for then Var(^4 | L) is close to zero and 
thus 1/Var(^4 | L ) can take on large values. In our 
opinion, this difference is not usually offset by the 
weaker restrictions imposed by the MSM. Indeed, 
the marginal treatment effect would seldom be of 
scientific interest when certain subjects are almost 
precluded from receiving treatment or no treatment. 
Moreover, the G-estimator retains a useful interpre¬ 
tation even when the assumption of constant treat¬ 
ment effects fails in the sense that E(Y l — Y° \ A = 
1 ,L) = ip(L) for some function ip(L). Indeed, in that 
case the locally efficient G-estimator converges to 

E{Var(A\L)iP(L)} 

{ ’ E{Yai{A\L)} ’ 

which continues to be useful as a weighted average 
of treatment effects ip(L), with most weight given to 
strata with most information about the treatment 
effect. 

This difference in asymptotic variance between 
both estimators becomes even more pronounced in 
the likely event that the model for E(Y \ A = 0 , L) = 
E(Y° | L ) = E(Y — ip* A | L ) is misspecified. Let 
A(L) = E(Y I A = 0,L) - E*(Y \A = 0,L) denote 
the degree of misspecification at covariate value 
L, with E(Y | A = 0,L) the true expectation and 
E*(Y | A = 0 ,L) the expectation used for evaluat¬ 
ing the locally efficient G-estimator. Furthermore, 
assume that in truth the treatment effect is homo¬ 
geneous. Then the asymptotic variance of the G- 
estimator becomes 

a 2 E{V&r(A \ L)A(L) 2 } 

E{Yar(A \ L)} + E{\ai{A \ L)} 2 ’ 

and the asymptotic variance of the previously con¬ 
sidered IPW estimator becomes 


<j 2 E 


1 


+ E 


Var (A | L) 

{A(L) + ip*E(l- A\L)} 2 


Var(j4 | L) 


Consider now that model misspecification is more 
likely in regions of little overlap. Then because 
Var(R | L) ~ 0 in these regions, model misspecifi¬ 
cation in these regions will only have a minor im¬ 
pact on the variance of the G-estimator, but a par¬ 
ticularly strong impact on the variance of the lo¬ 
cally efficient IPW estimator. Similar findings have 
been noted concerning the asymptotic bias of these 
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estimators (Vansteelandt, Bekaert and Claeskens 
( 2012 )). 

While this contrast between G-estimation and 
IPW-estimation under misspecification of the out¬ 
come model could turn out to be somewhat less dra¬ 
matic when the propensity score is not considered as 
fixed and known, we believe that the above findings 
more likely understate the factual differences if one 
considers that mainstream applications are based 
on sequential treatments (and thus even more vari¬ 
able inverse probability weights) and on simple, in¬ 
efficient inverse probability weighting methods. The 
latter can be viewed as inducing extreme misspec¬ 
ification in the outcome model as they amount to 
setting E(Y \A = 1,L) = E(Y \ A = 0,L) = 0. We 
thus believe that more routine application of G- 
estimation is warranted. 

5. STRUCTURAL NESTED MODELS FOR 
TIME-VARYING TREATMENTS 

Before introducing SNMs for time-varying or se¬ 
quential treatments, we consider the structure of 
observed data in observational studies with re¬ 
peated treatments and covariates, as well as def¬ 
initions of causal effects in such setting. Suppose 
that measurements are collected at fixed time points 
to,ti,... ,tx+i- Let Ak denote the treatment pro¬ 
vided at time tk, k = 0, ..., K, and L *. denote other 
covariates measured at that time; W, the outcome 
measured at time tk, k = 1,..., K + 1, is part of L 
We presume the variables are ordered Lq, Ao, Li, 
A\ , etc.; thus, covariates and outcome at tk precede 
treatment at tk- 

Let Tm™ 1 denote the outcome that would be seen 
at time t m in a given individual were (s)he to re¬ 
ceive treatment history a m -1 through time t m -\. 
The variables Ym n ~ 1 are potential outcomes, which 
are again linked to the observed data via the consis¬ 
tency assumption that Y m = Tm " 1-1 if A m _i = a m -i ■ 
We presume that treatment at or after t m cannot 
affect outcome at times up to t m \ thus, = 

yOrn-i.om £ Qr ^ ^ajn- Causal effects can now be 
defined as comparisons of potential outcomes Y_ clK 
for the same group of subjects for different treat¬ 
ment histories ax, cl' k , Hk / a' K (Robins (1986)). If 
the outcome is measured only at the end of a fixed 
follow-up period, or only at a subset of the follow¬ 
up times, we can let Y m = (•), where denotes 
missing or undefined values for the times where the 
outcome is not measured. Most of the subsequent 
presentation then applies to those settings. 


5.1 Structural Nested Mean Models 

Structural nested mean models (SNMMs) (Robins 
(1994); Robins, Rotnitzky and Scharfstein (2000)) 
simulate the sequential removal of an amount 
(“blip”) of treatment at t m on subsequent average 
outcomes, after having removed the effects of all 
subsequent treatments. Given a history a m , define 
the counterfactual history (a m , 0 ) as the history a t 
that agrees with a m through time t m and is 0 there¬ 
after. SNMMs then model the effect of a blip of 
treatment at t rn on the subsequent outcome means 
when holding all future treatments fixed at their ref¬ 
erence level 0 ; thus, they parameterize contrasts of 
Y°^’^\ and R^j 1,0 conditionally on treatment and 
covariate histories through t m as 

5 , {L'(Z^? I L m =l m ,Am = a m )} 

~ g{E(Y%£’° | L m = l m ,A m = a m )} 

= 1m (Jm , a m ! ^P 

for each m = 0,..., K and (J m ,a m ), where 7 j^(I m , a m ; 
ip) is a known (K + 1 — m)-dimensional function, 
smooth in ip, and for each l m ,a m _ 1 and ip it is 
by definition required that 7 ^(l m ,a m _i, 0 ;^) = 0 . 
Alternatively, one may focus on the effect of treat¬ 
ment on the end-of-study outcome Y = Yk +1 only, 
in which case one obtains a SNMM of the form 

g{E(Y u ™’° \L m = l m ,Am = a m )} 

- g{E(Y°‘ rn - 1, ° | L m = l m ,A m = a m )} 

= 1m Q"m 1 a m ! V 7 ) j 

for each m = 0,..., K and (l m , a m ), where 7 ^ ( l m , o m ; 
ip) is now 1-dimensional. The above contrasts gen¬ 
eralize the notion of the effect of treatment on the 
treated to the setting of a sequence of treatments. 
The name “nested” refers to the nesting across time, 
of the subgroups defined by L m and A m within 
which the effects are evaluated. 

Typically, the parameterization is chosen to be 
such that 7 ^(l m , a m ; 0 ) = 0 for all l m ,a m so that 
ip = 0 encodes the null hypothesis of no treatment 
effect. For instance, with 2 time points (K = 1) a 
linear SNMM may be given by 

E(Y 2 (ao,ai) - Y 2 (ao ’ 0) \L 1 =J 1 ,A 1 =a 1 ) 

= (ipo + iplh +ip%a 0 )ai, 

E (y^ -Y 2 °\L 0 = l 0 ,A 0 = a 0 ) 

= (V’l + "04^0 )«0 j 

E(Y} ao ’ 0) -Y?\Lo = lo,Ao = ao) 

= C 0 O +V’*^o)ao- 
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Here, the first equation models the effect of A\ on 
Y 2 , the second models the effect of Aq on Y 2 and the 
third models the effect of Aq on Y\. all within levels 
of variables defined prior to the considered exposure. 
Thus, ' 00,^1 an d V , 2 encode short-term treatment 
effects, which are here assumed to be constant at all 
time points, and ifj 3 and ^4 encode long-term treat¬ 
ment effects. These effects are visualised in Figures 2 
and 3 below. When interest merely lies in the effect 
on the end-of-study outcome, then the above model 
for Y\ can be ignored. 

Under the SNMM, as in Section 2.1, it is possible 
to define a transformation of Y_ m+1 , whose 

mean value equals the mean that would be observed 
if treatment were suspended from time t m onward, 
in the sense that 


(24) 




— E{Y™ + 1 1 ' | L m , — a m _i, A m )i 


for m = 0,. .. ,K. Here, is a vector with com¬ 

ponents 

fc-i 

l=m 

for k = m + l,...,K+l (or for k = K + 1 only if 
interest merely lies in the effect on the end-of-study 
outcome) if g {-) is the identity link, and 

{ k -1 

~Y^tk{ L h A l^) 

l=m 

if g(-) is the log link. These equations formalize the 
notion of peeling off or blipping down the treatment 
effects over the treatment period from t m to tk-i- 
For instance, in the previous example for 2 time 
points, 

UiW) = Y 2 - W’O* + riLi + ^A 0 )A lt 

W) = (W - (V’o* + riLo)A 0 ,Y 2 



A 

^ 0 +^ 1 +^ 

V 


A 

4 > 0 + 4>2 

V 

V 



A 

jlpo+lpl+lp 


A 




V 

A 





o 


2 


2 


Fig. 2. Visualisation of the effects E(Y^ a °’ 0 ^ — Y j° | Lo = lo,Ao = ao ) and EfY^ 0,0 ’ 0 ’ 1 '* — Y.Y 0 ’ 0 ' 1 | L\ = h,Ai = ai). Lines 
within the circles depict covariate strata; lines outside the circles depict exposure strata. 
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'A 

IP3+4J4 


'A 

4^3 


.V 


Fig. 3. Visualisation of the effects -E(y} a °’ 0> — Y° \ Lq = Iq,Aq = ao). Lines within the circles depict covariate strata; lines 
outside the circles depict exposure strata. 


— (V’o + V’1-^1 + )A\ 

~ (^3 + ^lLo)A o y. 

For link functions other than the identity and log 
link, such a transformation can still be defined, but 
depends on the observed data distribution in a com¬ 
plicated and contrived way. For instance, when g(-) 
is the logit link and there are 2 time points (K = 1), 
then under the SNMM we have that 

EiY® I Lq = lo, Ao = ao) 

= 9 ~ 1 [ 9 {E(g~ 1 [g{E(Y 2 \ Li, Ai,A 0 = a 0 )} 

- 1 * 1 (L 1 ,A 1 ,A 0 = ao-,r)\ 

I Lq = lo,Ao = ao)} 

- r ro( l o,a 0 ;ip*)]- 

The calculation of F/o (VO thus not only demands 
knowledge of E[Y -2 \ L\,A\), but also of the distri¬ 
bution of (Li,Ai), given (Lo,Aq). 


The effect of a sequential treatment on the fail¬ 
ure time distribution can be parameterized through 
a collection of SNMMs with log link, one for each 
time point (Robins and Hernan, 2009; Picciotto et 
al., 2012). In continuous time (Martinussen et al. 
(2011)), such structural nested cumulative failure 
time models are defined by restrictions of the form: 

p( T am,0 >t \L m = l m ,A m = g m ,T> t m ) 
p(Ta m - 1,0 > 1 1 L m =l m ,A m = a m ,T>t m ) 

= ex P{7 

for all t and m = 0,..., K, where 7j^(t,Z m , a m ; '*/>) is 
a known function, smooth in ij; and monotonic in t, 
and 7m(Mm, a m _i,0; ip) = 0 for all t,l m ,a m -i and 

■0- 

5.2 Structural Nested Distribution Models 

Structural nested distribution models (SNDMs) 
are closely related to SNMMs, but parameterize a 
map between percentiles of the distribution of Y? m ’° 
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and percentiles of the distribution of Y k rn ~ 1 ! °. They 
are most easily understood by first considering the 
class of more restrictive rank-preserving SNDMs. In 
particular, for each exposure A m , m = 0,..., K, let 
us first consider a rank-preserving SNDM to param¬ 
eterize its effect on the end-of-study outcome Y: 

yom-i, 0 _ a m ; ip*), 

for subjects with A m = a m and L m = l m , m = 0, 
...,K. Here, 7 m(y Jm, 4; ip) is a known function, 
smooth in ip and a smooth, monotonic function of 
y, which contrasts the counterfactuals Y arn ~ 1,0 and 
h“ m ’°, and must satisfy 7 m (y,I m , a m -i, 0 ; ip) = y for 
all y and ip. For instance, with 2 time points (K = 
1) a rank preserving SNDM may be given by the 
following set of restrictions: 

Y A °’° = 'y 1 (Y,L 1 ,A l ) = Y- (iPl + 

Y° = 'y 0 {Y A °'°,Lo,A 0 ) 

= Y Ao ’° - (ipl + ip%L 0 )A 0 

= Y- rMo + A x ) - ipZi^Ai + L 0 A 0 ). 

A SNDM relaxes these restrictions by demanding 
that they merely hold in distribution, conditional on 
the observed history (i.e., L m = l m and A m = a m ). 

To describe_the effect on a repeated counterfac- 
tual future we can borrow ideas from Sec¬ 

tion 2.3.2. In particular, upon substituting A by 
A m , L by L m ,A m - i and Y k by Y^ k in the rank¬ 
preserving model (9), we obtain the identity: 

(25) Y m+k =7m,m+k(Y rr ^i :m+k Jm,a m -,1p ), 

for subjects with A m = a m and L m = l m , m = 
0, ...,K and k = 1 + 1 — m. Here, 

7 m,m+k{ym:m+kJm,a m ;ip) is a known function, 
smooth in ip and a smooth, monotonic function of 
y m +k, which contrasts the counterfactuals Y^Y 1,0 
and Y° n " A k , and must satisfy 7 m,m+k(ym:m+kJm, 
a m ~ 1 , 0 ; ip) = y m+k for all y m+k ,l m ,a m -1 and ip. For 
instance, with 2 time points (K = 1) a rank pre¬ 
serving SNDM may be given by the following set of 
restrictions: 

Yi° = 7 o,i(Yl,Lo,A 0 ;^*) 

= Y 1 -(iPI + iP* 2 L 0 )A 0 , 

(26) A 0 

Y 2 Ao ’° = 7i,2(Y 2 ,T 1 ,A 1 ;^) 

= Y 2 -(^ + ^Li)Ai, 

Y 2 ° = 7 o, 2 (Yi, Y 2 j 4 °’ 0 ,L 0 , 


= Y 2 ( 0, °) - (ipl + iplY^Ao 

(27) 

= Y 2 -{iPt+iP* 2 L l )A 1 
-(^ + ^Yi)A 0 . 

Here, the first two equations express short-term ex¬ 
posure effects, that is, the effect of Aq on Y\ and of 
A\ on Y^. The third equation expresses the effect of 
Aq on Y 2 (more precisely, its effect on Y^ 0,0 ). As in 
Section 2.3.2, this equation must take into account 
that the effect may be different depending on the 
outcome level at time t\\ this allows for Aq to also 
affect the dependence between Y\ and Y 2 , but ev¬ 
idently complicates interpretation. More generally, 
rank-preserving SNDMs allow for the effect of a m 
on Y m+ k , as encoded by a contrast of Y^™ k and 

YmPpk 1 ' 0 , depend on the history of treatments and 
covariates up to time t m , but additionally on the po¬ 
tential outcome history under the treatment regime 
(a m , 0 ), up to time t m +k- 1 - 
A SNDM relaxes the restrictions of a rank pre¬ 
serving SNDM by demanding that the equality (25) 
merely holds in distribution, conditional on L m = l m 
and A m = a m . Assuming that for given ( L m ,A m ), 
Y m , 1 has a continuous multivariate distribution 
with probability 1, a SNDM can thus be defined by 

j 

(28) 

“ F Y^°\L rn =i m PA m =a rn yy. m . + l)i 

for all where 7 m(y m+1 ip*) is a vec¬ 

tor with components r y m ,k(ym+i:m+k Jm, a m ; ip*) for 
k = 1,..., K + 1 — m, where the components ^ m ,k 
are defined in recursive fashion similar to in Sec¬ 
tion 2.3.2. 

Under the SNDM, a variable U m {ip*) = (U m ,m+i(ip*), 
..., Um^K+iiip*))' can be constructed which predicts 
how the outcomes past time t m would look like if 
treatment were suspended from time t m onward, in 
the sense that 


m-1 ’°IL -1 A -a {'Ymiy. 


—m +1 


, l m , dm 'i V* )} 


(29) 


P{U m {lp ) > y _ m _| L m ,A m — Q.m} 

= P 0 CT+ 1 0 > y m+1 I Lm,A m = a m ). 


This variable can be recursively obtained for m = 
K,..., 0 from 


(30) 


U, 


m,m-\-k 


w 


— 7m,m+fc{(^m+1? Kn+Lm+fc (VO) ? Lmi -A-m] V0-, 
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for k = 1,..., K + l — mn, where we define U m+ i }in+ h(ip) 
to be empty for k = 1. For instance, in the SNDM 
that assumes the identities in (26) hold in distribu¬ 
tion (conditional on the observed history), we have 
that 

= U 1>2 (iP) = 71 , 2 ( 12 , L^ArA) 

= Y 2 - (ip! +'i/j 2 L 1 )A 1 , 

U 0 (lP) = (U 0 ,!(lP),U 0 ,2(lP)) 

= (7o,i(hi,F 0 , A 0 -,ip), 


5.3 Retrospective Blip Models 

Retrospective blip models have been extended to 
model the effect of a sequential treatment on a scalar 
end-of-study outcome Y = Yr'+i conditional on the 
treatment and covariate history up to end-of-study. 
Mean models take the form 

g{E(Y a ™’° | L k = l K , A k = a K )} 

(31) -g{E(Y“™~ l ’°\L K =~l K ,A K = a K )} 

= 7 m^K,a K ',1p*), 


70,2(^i, Ui :2 (ip),L 0 ,A 0 ;ip )) 

= (Yi - (ip!+ip 2 L 0 )A 0 , 

Y 2 - (ip! + ip 2 L!)A! - (ip 3 + ip A Yi)A 0 ). 

The identity (30) will be useful in estimation and 
for predicting the effect of specific interventions on 
the outcome distribution. 

Structural nested failure time models (SNFTMs) 
are a variant of SNDMs which have seen most ap¬ 
plications to date. These link percentiles from the 
conditional distributions of T arn ~ 1,0 and T am, °, con¬ 
ditional on L m ,A m =a m , and for subjects who are 
still in the risk set (say, alive) at time t m : 

ST n ™-i’ 0 \nL m =l m ^ m =a m ,T>t m {A rn (ti a m\ V’*)} 

— l ^T° : ^i’ 0 \L rn =l rn ,~A rn =a rn ,T>t rn 

for t > t m , where S(-) denotes a survival function. 
Here, 7 m (t ,l m , a TO ; ip*) is a known function, smooth 
in ip and monotonic in t, and 'y m (t,l m ,a m -i, 0; ip) = 
t for all t,l m ,a m -! and ip. For instance, the choice 
7 m (t,1m i Q"m,i Ip) — tm T (t tm) exp ((Imlp') fol t > t m 
expresses that the effect of suspending treatment a m 
at time t m is to change the residual lifetime t — t m 
with a factor exp(a m V’)- F° r this choice of model, 
one can predict among individuals who survive to 
(or through, or until) time t m what their lifetime 
would be had treatment been suspended from time 
t m onward, as 

U m (ip) = t m + ^ (tk ~ tk-i) exp(A k ip) 

k:t m <t k <T 

+ (T-t T -)exp(A tT _ip), 

where t T - denotes the largest time point in {to,, 
tx) less than T and U m (ip ) is a random variable for 
which (for t>t m ) 

P { U m (ip ) > 1 1 L m , A m = a m , T 7 t m } 

= P(T “™~ l >° > 1 1 L m , A m = a m ,T > t m ). 


where 7 m(^A'> OA 3 VO a known function, smooth 
in ip and equaling zero for all ip, Ik and Hk with 
a m = 0. Distribution models take the form: 


PY a m-i’°\L K =l K ,A K =a K {7m(2/, lKi a K\Ip )} 
^Y s mfi\L K =i K ,A k=Sk (^/)> 

where r y m (y Jk, uk] ip) is a known function, smooth 
in ip and equaling y for all ip,y,lx and ax with 
a m = 0 ; a rank-preserving version of this was pro¬ 
posed by Joffe, Small and Hsu (2007). For nonpara- 
metric identifiability, restrictions are needed on the 
functions ^(Ir, a K \ ip*) and 7 m (yjK,a K ;ip), for 
example, that they do not involve the future a m+l 
and l m+ i (Vansteelandt (2010)). 

Retrospective blip models can be useful for model¬ 
ing a dichotomous outcome (Vansteelandt (2010)). 
Under these models, identity (24) is satisfied with 
Up n (ip) being a vector with components 


-1 


K 

g{E(Y I L k ,A k )} - y^(LK,A K -iP) 

l=m 


Evaluation of Up, (ip) (which is needed to make es¬ 
timation of ip* manageable) then merely requires a 
model for E(Y \ Lr,Ar), but not for the distribu¬ 
tion of treatment and covariates at each time. The 
parameters indexing these models are nonetheless 
more limited than the parameters indexing SNMMs 
in that they cannot be used by themselves for mak¬ 
ing treatment decisions prior to the end-of-study 
time, unless one integrates over the distribution of 
covariates subsequent to rn (see, e.g., Vansteelandt 
( 2010 )). 


6. IDENTIFICATION AND ESTIMATION IN 
STRUCTURAL NESTED MODELS FOR 
SEQUENTIAL TREATMENTS 

This section sketches identifying assumptions and 
inferential methods for sequential treatments. Under 
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instrumental variables assumptions sketched in Sec¬ 
tion 6.3 and under the future ignorability assump¬ 
tions sketched in Section 6.2, inferential methods 
have been developed for SNMs, but these assump¬ 
tions do not suffice for the identification of marginal 
treatment effects, and hence parameters indexing 
MSMs. The broader array of useful identifying as¬ 
sumptions thus constitutes an important advantage 
of SNMs. 

6.1 Sequential Ignorability 


Sequential ignorability (32) together with identity 
(29) moreover implies that 

U m {ip ) -LL A m | L m: A m -l 

for all m under the SNDM. This conditional inde¬ 
pendence restriction suggests that the parameter in¬ 
dexing a SNDM can be solved from 

n K 

°=EE dm { Uim (VO 5 A-im Lim} 

1 = 1 771=0 


The assumption of ignorable treatment assign¬ 
ment can be generalised to sequential treatments as 
follows: 


(32) 


I T A _ 77 

- | ^mi -rt-m—l — (l m— 1, 


for m = 0,_ ,K. This assumption has been called 

variously “no unmeasured confounders assump¬ 
tion,” “sequential ignorability,” “sequential random¬ 
ization” or “exchangeability.” It expresses that at 
each time t m . the observed history of covariates L m 
and exposures A m -i includes all risk factors of A m 
that are also associated with future outcomes. 

This assumption together with identity (24) imply 
that 


E\d m \Ui m (lp'), Ai mi Lim} | Tj m , Aim] 

(34) - E{d m {U im {iP) i Ai m , 

E[dm { Ui m {ip ), Ai m , Li m } | Li m ,Ai m ] | 

Uim {y} j Ui m i A-i^m, — 1 ) i 

where the index functions d m {Ui m (4’), Aj. m , Li m } 
must be of the dimension of ip. When the previ¬ 
ous outcome is included in the confounder history 
(i.e., Li m includes Tj m ), then local semiparametric 
efficiency is obtained upon choosing 

dm\Uim(dP ')) ^- 7771 ) Lim } 

— E{S^) | Uim(l/j') , Ai m , Lim} , 


E{Um(r) I L m ,A m } = E{U m {*P*) I L m ,A m ^} 

for all m under a SNMM. The parameter ip* index¬ 
ing a SNMM can therefore be estimated by solving 


where S^{ip) is the score for ip under the observed 
data likelihood 

f{YK+i,L K ,A K ) 


n K 

o^EEp A-im) 

i =1 m= 0 

(33) E{d m (Lim: A-im) \ Limi 1}] 

X [UimW ~ E{U im ^) | 

Limi Ai m—1 }], 

where d m (Lj m , Aim), fn = 0,..., K is an arbitrary 
p x ( K + 1 — m)-dimensional function, with p the 
dimension of ip. This estimating equation essentially 
sets the sum across time points m of the condi¬ 
tional covariances between Ui m (i’) and the given 
function d m {Li m , A im ), given L im ,A it m- 1 , to zero. 
When the previous outcome is included in the con- 
founder history (i.e., Lj m includes T) m ) and there 
is homoscedasticity [i.e., when the conditional vari¬ 
ance of UimipP*) given Li m , Ai m is constant for m = 
0,..., K], then local semiparametric efficiency under 
the SNMM is attained upon choosing 


dm (Limi Aim ) — E 


dUmW 

dip 


Limi Ar 


= f{U 0 (iP*)} 


K 

n 

771=0 


f{L m [ L m -hA m _i, U m {ip*)} 


x f{A m | L m ,A m -i,U m (ip*)} 

_ dUm(r) 
dU m +l(V) _ 


with all components substituted by suitable para¬ 
metric models (Robins (1997)); here, the term 

f {A m | L mi A m — i, U m {' l P )} — f {A m | L m ,Am—l) Un 

der sequential igorability, and thus can be ignored. 
This likelihood formulation is of interest in itself 
because it enables specifying the joint distribution 
of the variables in a way that is consistent with the 
sharp null hypothesis of no effect under the assump¬ 
tion of sequential ignorability, even in the presence 
of confounding by variables affected by treatment, 
which turns out more difficult with standard pararn- 
eterisations (Robins (1997)). 
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Solving estimating equations (33) and (34) re¬ 
quires a parametric model A for the conditional dis¬ 
tribution of the exposure A m for m = 0,..., K: 

f(A m | L m — i^Am —1 ) — / ( A m | L m — i, A m —ii a ), 

where f(A m | L m -i,A m -i',a) is a known density 
function, smooth in a, and a* is an unknown finite¬ 
dimensional parameter which can be estimated via 
standard maximum likelihood. In addition, it re¬ 
quires a parametric model B for the conditional 
mean (or distribution) of [or U m {^*)\ for 

m = 0,...,K: 

/{f7 m (^)|L m ,A m _!} 

where f{U m (^*) I is a known density 

function, smooth in 7 and 7 * is an unknown finite¬ 
dimensional parameter. As before, when the param¬ 
eters a and 7 are variation-independent, then so- 
called G-estimators that solve (33) and (34), ob¬ 
tained upon substituting a* and 7 * by consistent es¬ 
timators, are doubly robust (Robins and Rotnitzky, 
2001): consistent when the SNM and either model 
A or model B is correctly specified, regardless of 
which. This double robustness property of the G- 
estimator is desirable for various reasons. First, it 
provides justification for using simple models for the 
multivariate distribution /{Umi'ip*) \ L m ,A m _ 1 } or 
even setting E{U im (ip) \ L im ,A i:m _ 1 } = 0 in (33) for 
computational convenience. Second, while alterna¬ 
tive proposals that rely on correct specification of 
model B (see, e.g., Almirall, Ten Have and Murphy 
(2010); Henderson, Ansell and Alshibani (2010)) 
tend to give more efficient estimators (under correct 
model specification), the concern for misspecffica- 
tion of model B may be considerable in view of the 
aforementioned difficulty of postulating this model. 
This distribution can indeed be difficult to specify in 
view of its multivariate nature, the fact that U m (i/j*) 
represents a transformation of the observed data 
and that it may moreover share the same outcome 
over multiple time points, so that the models for 
U m (r) corresponding to different time points may 
not be congenial at all times. This concern can be 
overcome by inferring the conditional expectations 
E{U m (il')*) | L m ,A m ~ 1 } from models for the condi¬ 
tional distribution of L m+ \ given L m , A m -i at each 
time m (Robins, Rotnitzky and Scharfstein (2000); 
Almirall, Ten Have and Murphy (2010)). However, 
when the covariate L m is high-dimensional and/or 
strongly associated with treatment A m , specifying 
such models can be a thorny and nontrivial task. 


6.2 Departures from Sequential Ignorability and 
Sensitivity Analysis 


Specified departures from (32) can also yield iden¬ 
tification. For instance, one can allow dependence of 
treatment on a specified portion of the future poten¬ 
tial outcomes, by relaxing (32) to (Joffe, Yang and 
Feldman (2010); Zhang, Joffe and Small (2011)) 


A r , 


y a m-l>0 I T A _ 77 

- — 777 ,-(-A I Lj m', ^-m— 1 — Uj rn—l 


Y 


-1,0 


’ m+Lm+A—1 ’ 


for some integer A > 1 ; such assumptions have been 
termed future ignorability, since the independence 
at m is conditional on potential outcomes referring 
to times after m. This assumption can sometimes 
eliminate residual confounding bias, for instance, 
because the treatment process occurs in continuous 
time but confounding covariates are only measured 
intermittently, as is common in observational studies 
(Zhang, Joffe and Small (2011)), or when the future 
potential outcomes serve as proxies for other unmea¬ 
sured confounding variables (Rosenbaum (1984)). 
However, it does not lead to nonparametric iden¬ 
tification of the SNM parameters, so that inference 
becomes more dependent on correct specification of 
the causal model. 

Alternatively, deviations from sequential ignora¬ 
bility can be parameterized as 


f (Am — dm | E m — l m , 

Am-1 = ffira— = y_ rn+ i ) 


(35) 


— t\A m — dm | L m — l m i A m — i — d m —l) 


■ exp{g m (y m _^, l mi a m )} 



Em — lm j A m — 1 — d m —l) 


• exp {q m (y m+v l m , (am-i,al n ))}dal n J , 

with q m (-) known, satisfying q m (y m+1 Jm, 4 - 1 , 
a m = 0 ) = 0 for all (y m+1 Jm, 4 -i) and with t(A m \ 

Lm-, A m - 1 ) an unknown conditional density. With 
Qm(-) = 0 encoding the assumption of sequential ig¬ 
norability, the function g m (-) thus expresses the de¬ 
gree of departure from that assumption. As the data 
carry no genuine information about it, progress must 
be made by repeating the analysis with q m (-) fixed 
at different values, which are then varied over some 
plausible range (Robins, Rotnitzky and Scharfstein 
(2000)); for example, by setting q m (y m+1 ,l m ,a m ) = 
7 ?/ m+ ia m , where 7 is varied between —1 and 1 . 
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6.3 Instrumental Variables Assumptions 

When the assumption of sequential ignorability 
fails, progress can sometimes be also made using an 
instrumental variable (IV). Such variable Ho is as¬ 
sumed to satisfy 

(36) H 0 _U_Y 0 |L 0 
and 

(37) F Y°\Lo=lo,Ao=ao(y) = F Y a O’°\ Lo= i 0 ,A 0 =a 0 (y) 

for all ao,lo (Robins (1989)). Both these assump¬ 
tions together imply that the instrument Ho is not 
associated with the outcome, except through its as¬ 
sociation with subsequent treatments A m ,m > 1, 
which may affect outcome. These or similar as¬ 
sumptions have been used in adjusting for noncom¬ 
pliance in randomized trials (Robins and Tsiatis 
(1991); Mark and Robins (1993); Robins (1994)). 
With A m ,m > 1 denoting actual treatment and Ho 
denoting randomized treatment, these assumptions 
are plausible when randomization does not affect the 
outcome other than by influencing the actual treat¬ 
ment. 

Estimation under the IV assumptions can be 
based on estimating equations (33) and (34), but 
requires setting d m (L im ,A im ) = 0 and 
Ai m , Li m } = 0 for m > 0. Because of these re¬ 
strictions, root-n estimation of if’* typically re¬ 
quires additional assumptions on 7 j^(I m , Om! V’*) and 
In particular, it is commonly 
assumed that these functions are linear in a m and 
do not involve ao; moreover, time-varying covariates 
are commonly ignored, that is, L m is set empty for 
m > 0. For instance, in linear SMMs for a single 
treatment Hi (i.e., when K = 1) and dichotomous 
instrument, uj(Lq) in 

E(Y 1 - Y^ 0 | L 0 = l 0 ,A = a 1 )=u(l 0 )ai, 

is just identified. Thus residual dependencies on ao 
or nonlinear dependencies on ai cannot be identified 
unless other untestable assumptions are imposed. 

The resulting class of G-estimators contains the 
popular two-stage least squares estimator as a spe¬ 
cial case (Okui et al. (2012)). However, the frame¬ 
work of G-estimation for SNMMs and SNDMs 
has the advantage that it extends immediately to 
outcomes that do not lend themselves to linear 
modeling, for example, censored failure-time out¬ 
comes (Robins and Tsiatis (1991)) and dichotomous 
outcomes (Vansteelandt and Goetghebeur (2003); 


Robins and Rotnitzky (2004)), as well as to sequen¬ 
tial treatments (Robins and Hernan, 2009). For in¬ 
stance, when K = 1 and L\ is empty, the logistic 
SMM 

odds(y 2 ai = 1 | L 0 = lo,A 1 =di) 
oddsU^llLo^l-aO 
can be fitted by solving the SMM estimating 
equations with given by expit{logit E(Y 2 \ 

Ai,Lq ) — tpAi} [cfr. (3)] and E{Y\ \ A\,Lq) substi¬ 
tuted by the fitted value under a parametric model 
(Vansteelandt and Goetghebeur (2003); Vanstee¬ 
landt et al. (2011)). This additional model may 
sometimes not be congenial with the SMM and in¬ 
strumental variables assumptions in the sense that 
there may be no choice of parameter values index¬ 
ing this model that satisfies these assumptions. This 
can be overcome by avoiding parameterization of the 
main effect of Ho (conditional on Lq) in the model 
for E(Y\ | Ai, Lo) and instead modeling the distribu¬ 
tion of Ai, given Ho and Lq (Robins and Rotnitzky 
(2004)), or by completely saturating the parameter¬ 
ization of the main effect of Ho (conditional on Lq ) 
(Vansteelandt et al. (2011)). van der Laan, Hubbard 
and Jewell (2007) abandon logistic SMMs in favor 
of an interesting, but difficult to interpret relative 
risk parameterization. Alternatively, multiplicative 
SMMs can be used; under such models, case-only es¬ 
timators have been constructed, which remain valid 
under case-control sampling (Bowden and Vanstee¬ 
landt (2011)). 

Variant assumptions have been proposed that al¬ 
low use of time-varying instruments along with SN¬ 
MMs and G-estimation. Robins and Hernan (2009) 
consider settings in which, at each time point, there 
is a variable whose association with the outcome of 
interest may be explained solely by its association 
with prior history and its effect on some treatment 
of interest. Joffe, Yang and Feldman (2010) consider 
settings in which the conditional independence of 
treatment and future potential outcomes in (32) 
holds for only an identifiable subset {i,m} of the 
person-observations in the population rather than 
for all such observations. Treatment assignment in 
that subset may thus be considered an instrument 
for its effect and the effect of subsequent treatments. 

IV analyses have several drawbacks relative to 
those based on sequential ignorability: (1) nonpara- 
metric identification is lost, and so inference is more 
dependent on correct specification of the causal 
model; (2) decreased power and precision; and (3) 
larger finite-sample bias. 
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6.4 Censoring 

In SNFTMs, Type I censoring can be dealt with 
as previously explained by substituting U m (ip) by an 
arbitrary function of = min{C/ m (-0), C m (V ; )} 

and A m (ip) = I{U m (ip) < where 

Cm(^) = ^^{U m (C,ac,lc;^)',ac,lc e LA m (C)}, 

where LA m (C) is a given set of (acJc) histories 
which agree with the observed history of T through 
time t m or C, whichever comes first, and A through 
time t m _i or C , whichever comes first, and where 
U m (C,acJc','4’) is defined like U m {^i) in Section 5.2, 
but with C replacing T and ac and lc replacing At 
and Lt- 

7. PREDICTING THE EFFECTS OF 
INTERVENTIONS 

Identities (24) and (30) suggest using U* n ('4>*) 
and respectively, as a prediction of T^A 1,0 

among individuals with observed history A m = am- 
In particular, E(Y_° | Ho, To) = E{Uq(4>*) | Ho, To} 
in SNMMs and E{Y° \ H 0 ,T 0 ) = E{U Q {$*) \ H 0 ,T 0 } 
in SNDMs, so that the expected outcome in the ab¬ 
sence of treatment can be estimated as the sam¬ 
ple average of Uq('4>) in SNMMs and of To (VO in 
SNDMs. To estimate E(Y _ aK ) for a different treat¬ 
ment regime ax , one could use a different structural 
nested model (SNM) with ax as the reference treat¬ 
ment regime. However, when—as often—the interest 
lies in comparing the expected counterfactual out¬ 
comes between different treatment regimes, then a 
concern is that these different SNMs may fail to im¬ 
ply a coherent model. Further complications arise 
when the goal is to evaluate the expected counterfac¬ 
tual outcome following a dynamic treatment regime 
whereby the treatment at each time t m is assigned as 
a function of the treatment and covariate history up 
to that time; that is, for each m, a m = g( dm —11 ^m) • 

These complications can be overcome by sup¬ 
plementing the SNM with so-called current treat¬ 
ment interaction functions (Robins, Rotnitzky and 
Scharfstein (2000)) about which the data carry 
no information, but which enable one to trans¬ 
port treatment effects in the treated to population- 
averaged treatment effects. For instance, let K = 1 
and suppose that a SNMM has been fitted with g{-) 
the identity link. For simplicity, we focus here only 
on the effect of a nondynamic regime (do,di) at an 
end-of-study outcome Y = Y 2 ] results for dynamic 
treatment regimes are recovered upon making the 


substitutions (go(lo),gi(ao,l\)) for (do,di). Two cur¬ 
rent treatment interaction functions can be defined, 
one for each sequential treatment: 

r*{Li,ai) 

= E(Y a ° ai - y Q °° I Ho = u 0 , A\ = ai M) 

_ E(Y a oai _ Y a 0 0 | A 0 = ao,A 1 ^a l ,L 1 ), 

r* 0 (Lo,ai) 

= E(Y a ° ai — y° | H 0 = a 0 , T 0 ) 

— E (Y a ° ai — y° | H Q 7^ a 0 , L q ). 

These express how much the effects of subsequent 
treatment at m [i.e., a\ and (ao, a\) at times 1 and 0, 
resp.] differ between groups that received that level 
of treatment at m and those that did not. Under 
the SNMM, it is easily deduced from knowledge of 
rl(Li,ai ) and rQ(To,di) that E(Y a ° ai — Y° \ A$ = 
ao, To) equals 

E(Y aoai -Y ao ° \A 0 = d 0 , L 0 ) 

+ E(Y ao ° — y° | H 0 = a 0 , T q ) 

= E{ 1 \(L 1 ,a 1 -r) 

- ri(Ti,ai)P(Hi ^ ai | A 0 = a 0 ,Ti) | 

Ho = ao, To} 

+ 7o *(To,ao;V’*). 

Because E(Y a ° ai — y° | To) moreover equals 

E (Y a ° ai — y° | H Q = d 0 ,T 0 ) 

- ^(To, Oi)-P(H 0 / do I To), 

we thus obtain that E(Y a ° ai ) = E(Y a ° ai — Y°) + 
E(Y°) equals 

E[E{^(Li,a i; r) 

-r}(Ti,di)P(Hi y^di | H 0 = n 0 ,Ti) | 

Ho = do, To} 

+ 7o(T 0 ,d 0 ;'i/’*) 

- rS(T 0 , di)P(H 0 / d 0 | T 0 ) + W)]- 

When there is no current treatment interaction [i.e., 
r}(/i, di) = Tq(/o, di) = 0 for all do, di, Iq, Zi], we thus 
have that 

T/(y a ° a i) 

= P[P{7i(Ti,di;^*) | H 0 = n 0 , T 0 } 

+ 7o(T 0 ,a 0 ; V’*) + ^0 W’*)]- 
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While the components 7*(Li,ai; ip*), ^(Lq^q^iP*) 
and Uq(iP*) can be estimated along the lines de¬ 
scribed in previous sections, a complication is that 
a model for the distribution of L\, conditional on 
Ho, To, is needed to evaluate this; this can be cum¬ 
bersome when L\ is high-dimensional. This com¬ 
plication is avoided in simple structural models 
in which there is no effect modification by post¬ 
treatment variables [i.e., 7*(Ti,ai) is not a function 
of L\] and nondynamic regimes are considered. 

The assumption of no current treatment interac¬ 
tion is satisfied under a mild strengthening of se¬ 
quential ignorability such that 

A m -LT ^l m +l I A m — 1 — lj 

for all m and all treatment histories ax- It is like¬ 
wise sometimes satisfied under a mild strengthening 
of the instrumental variables assumption (36) such 
that for all treatment histories ax- 

A 0 ±±Y“«\L 0 , 

and a mild strengthening of the structural model 
such that, for instance, for binary Ai (0/1): 

E (Y a 0°! - \A 1 = a 1 ,A 0 = ao, To) 

= 7i(°i,- L o;V’*)(ai - 4)> 

for all ai,a{. Following the instrumental variables 
assumptions, Y a °° and Y a ° l should then be inde¬ 
pendent of Ho, given To, which respectively implies 
that 

E{Y-^(0,L 0 -,r)Ai\A 0 ,L 0 } 

= E{Y — 7?(0, T 0 ; 'ip*)A\ | T 0 }, 

£{T —7i*(l,T 0 ;Vn(l-^i)l A),T 0 } 

= £{y —7i*(l,T o; r)(l-^i)|To}. 

It follows from this that 7*(0,To;V J *) = — '7^(1, To; 
ip*), and thus again that the no current treat¬ 
ment interaction assumption is satisfied (Hernan 
and Robins, 2006). 

8. DIRECT AND INDIRECT EFFECTS 

SNMs parameterize the effects of treatment at 
each time with subsequent treatments set to some 
reference level. These effects can be viewed as con¬ 
trolled direct effects (Robins and Greenland (1992)), 
controlling all subsequent treatments at their refer¬ 
ence levels. The formalism of SNMs is therefore more 
widely applicable for inferring the controlled direct 


effect of some target exposure Aq on an outcome Y, 
other than through some mediator Hi (e.g., the di¬ 
rect effect of the FTO gene on the risk of myocardial 
infarction other than via body mass). In particular, 
in the SNMM 

E(Y- y ao ° \A l = a 1 ,L 1 ) = 7i*(ai, Lr,ip*), 
E{Y a o° -y°Mo = a 0 ,T 0 ) = 7oK W), 

7o(ao, To; ip*) encodes the controlled direct effect of 
setting Aq to zero, controlling A\ at zero uniformly 
in the population. However, caution is warranted be¬ 
cause 7 q(oo, To; ip*) may not encode the controlled 
direct effect of setting Ho to zero, when controlling 
A\ at some value a± ^ 0 (Robins and Wasserman 
(1997)). From knowledge that 70(00,^0!^*) f° r a ll 
ao, lo) one th us cannot deduce that Ho has no direct 
effect on Y (other than via Hi). Robins (1999) there¬ 
fore proposed directly parameterizing the controlled 
direct effect as 


E(Y aoai -Y 0ai | H 0 = a 0 ,T 0 ) 

(38) 

= m(a 0 ,a 1 ,L 0 ;ip ), 

where m(aQ,a\,Lo]ip) is a known function, smooth 
in ip, which satisfies m(0,ai,lo',ip) = 0. In contrast 
to SNMMs, (38) parameterizes only the effect of ao; 
in (38), ai may, however, be a modifier of the effect 
of a o- 

Since model (38) for fixed ai is a SMM for the 
counterfactual outcome Y ai , the techniques of Sec¬ 
tion 3 would be applicable to estimate ip* if Y ai were 
observed for each subject. Since Y ai is only observed 
for individuals with exposure level ai, Robins (1999) 
proposed treating subjects who receive a level of Hi 
other than ai as censored and, assuming sequen¬ 
tial ignorability, to inversely weight the data by the 
density /(Hi | Ho,Ti) to control resulting selection 
bias. This amounts to solving ip from an estimating 
equation of the form 


0 = 


n 

-T 

n 7—' 


f(An | Hjo,Tji) 


^w[rf(^4jo,Tjo) 


(39) 


— E{d(Aio, Tjo) | Tjo}] 


x [Yi - m(Hji, Tj 0 ; ip) 


- E{Yi - m(Aii,L i0 -ip) | L i0 }\, 


where d(H,o, T,o) is an arbitrary index function. 
More efficient and doubly robust estimators have 
been reported elsewhere (Goetgeluk, Vansteelandt 
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and Goetghebeur (2008)), as well as extensions to 
time-varying treatments (Robins (1999)). 

Ignorability assumptions can be violated even in 
randomized trials (and Mendelian randomization 
studies), where assumption (36) is guaranteed by 
design, but the processes underlying the evolution 
of subsequent mediators may be poorly understood. 
Robins and Greenland (1994) avoid ignorability as¬ 
sumptions concerning the mediators by using ini¬ 
tial randomization (or more generally, instrumental 
variables assumptions) to estimate controlled direct 
effects with SNFTMs. One can also use these ap¬ 
proaches with SNMMs or SNDMs (e.g., Ten Have 
et al. (2007)) and, in principle, in the presence of 
multiple mediators. 

SMMs have also been developed for so-called nat¬ 
ural direct effects (Robins and Greenland (1992); 
Pearl (2001)). With Y a ° A ° denoting the counterfac- 
tual outcome if Aq were set to oo and A\ to the 
counterfactual level A® that A\ would take if Aq 
were set to zero, these are defined by contrasts be¬ 
tween Y a ° A ° and Y ® A i for some ao ^ 0. Because A® 
may often reflect a natural level of A± (as in the ab¬ 
sence of treatment) which differs between subjects, 
natural direct effects may have a more appealing 
interpretation than controlled direct effects. They 
moreover correspond with a measure of natural in- 

A a 0 

direct effect in terms of contrasts between Y a ° A i 
and Y a ° A ° for some oo 7^ 0. SMMs for natural direct 
effects have been considered van der Laan and Pe¬ 
tersen (2008) and Tchetgen Tchetgen and Shpitser 
(2011). Such models are defined by 

E(Y a ° A °-Y 0 A °\A 0 = a 0 ,L 0 = l 0 ) 

(40) 

= m(a 0 J 0 ;i’ ), 

for each where m(ao, Lo;ip) is a known func¬ 

tion, smooth in which satisfies m(0, Lq;iI>) = 0. 
Extensions to sequential treatments or mediators 
have so far not been developed in view of difficulties 
of identification in such settings. 

9. CONCLUDING REMARKS 

Structural nested models were designed in part to 
deal with confounding by variables affected by treat¬ 
ment. These models maintain close resemblance to 
ordinary regression models by parameterizing con¬ 
ditional treatment effects. However, in contrast to 
these, they avoid conditioning on post-treatment 
variables by modeling the outcome at each time 


conditional on the treatment and covariate history 
up to that time; they do this after having removed 
the effects of later treatments so as to disentangle 
the unique contributions of each treatment at each 
time. The associated method of G-estimation has 
close resemblance to ordinary regression methods 
because it realizes control for measured confounders 
through conditioning. In spite of these strong con¬ 
nections with popular estimation methods, SNMs 
and G-estimation have not become quite as popular 
as MSMs and the associated IPW methods (Robins, 
Hernan and Brumback, 2000). 

The lack of popularity of G-estimation is largely 
related to the fact that it cannot usually be per¬ 
formed via off-the-shelf software; however, note that 
SAS and Stata macros for SNFTMs and SNCFTMs 
are available at http://www.hsph.harvard.edu/ 
causal/software/. This lack of popularity is ad¬ 
ditionally related to difficulties in solving the esti¬ 
mating equations in the analysis of censored survival 
times using SNFTMs. These difficulties can now be 
overcome by using the newer class of SNCFTMs 
instead (Picciotto et ah, 2012; Martinussen et al. 
( 2011 )). 

In spite of these limitations, SNMs and G-estimation 
allow for greater flexibility than MSMs and typically 
yield better performing estimators (see Section 4.1). 
This is especially so when handling continuous ex¬ 
posures or when handling a binary exposure that 
is strongly correlated with subject characteristics 
(e.g., when the treated and untreated are very dif¬ 
ferent in terms of subject characteristics). In the 
latter case, IPW estimators will typically have a 
poor performance, reflecting the lack of information 
about the treatment effect in strata where most/all 
subjects are treated or untreated. In contrast, be¬ 
cause SNMs parameterize treatment effects condi¬ 
tionally on covariates, nonsaturated models allow 
for borrowing of information, so that G-estimators 
can pool the treatment effects across strata, as in 
expression (23), downweighing those strata where 
information on treatment effect is lacking. SNMs 
can also incorporate effect modification by time- 
varying covariates. As such, a saturated SNM en¬ 
codes all possible causal contrasts on the consid¬ 
ered scale, in contrast to MSMs which average the 
effects across (time-varying) covariates, thereby di¬ 
luting the effects when effect heterogeneity exists on 
the considered scale. SNMs can moreover make use 
of instrumental variables. 
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G-estimation is not to be confused with G- 
computation (Robins (1986)), which involves stan¬ 
dardizing the predictions from an outcome model 
corresponding to the considered treatment regime, 
relative to the confounder distribution in the pop¬ 
ulation. Up to recently, also this approach has re¬ 
ceived little attention in practice because it is com¬ 
putationally intensive and because correct specifica¬ 
tion of models for the distribution of the (possibly 
high-dimensional) confounders can be a thorny is¬ 
sue in practice. These concerns, which also relate to 
likelihood-based inference under SNDMs (Robins, 
Rotnitzky and Scharfstein (2000)), can be some¬ 
what mitigated by summarising the confounders at 
each time by a longitudinal propensity score de¬ 
fined as the probability of treatment at that time, 
given the history of confounders at that time (Achy- 
Brou, Frangakis and Griswold (2010)). However, 
this may demand correct specification of propensity 
score models in addition to a model for the out¬ 
come at each time. G-computation moreover does 
not enable a transparent parameterization of the ef¬ 
fect of a particular treatment regime on the outcome 
and may thereby imply a null paradox (Robins and 
Wasserman, 1997) according to which tests of the 
null hypothesis of no effect may be guaranteed to 
reject in large samples (Robins (1997)). However, 
recent empirical applications have turned out to be 
rather successful (Cain et al. (2011)). 

We have attempted to make the literature on 
structural nested models and G-estimation more ac¬ 
cessible, while also giving pointers to the related lit¬ 
eratures on effect modification and mediation. Vari¬ 
ants of SNMs have also been developed to help iden¬ 
tify optimal sequences of treatments when treat¬ 
ments may be assigned dynamically as a function 
of previous treatment and covariate history. In such 
settings, it is more natural to model the effect of a 
blip of treatment at m on a particular utility func¬ 
tion Y, such as the outcome at the end-of-study 
time, if all subsequent treatments are optimal; that 
is, ak = al pt (Jk,ak~ i) for k > m. This can be done 
by parameterizing the so-called regrets: contrasts 

— Opt _ _ _ ODt 

of EjX a ™’Zrn +1 | L m ,A m = a m ) and E(Y a ™~ | 
L m ,A m = a m ) (Murphy (2003)). Alternatively, since 
the optimal treatment is unknown, it may be eas¬ 
ier to parameterize the effect of a blip of treat¬ 
ment at m relative to no treatment when all future 
treatments are optimal. This amounts to contrasting 
_E(yam,a m+1 | L m ,A m = a m ) and E{Y a ™~^™+i | 
L m ,A m = a m ) (Robins (2004)). We refer the reader 


to other papers in this issue for detailed accounts of 
such models. We conclude by expressing our hope 
that efforts will be continued to develop computa¬ 
tional algorithms and corresponding software pro¬ 
grams for SNMs, so as to make these methods ac¬ 
cessible to a wider audience. 
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